[None][perf] Add GreenContext SM-partitioned overlap for MoE DenseGEMM FC1+Router by JacobHu-NV · Pull Request #12802 · NVIDIA/TensorRT-LLM

JacobHu-NV · 2026-04-07T08:32:51Z

Summary

Add green_context.py: CUDA Driver API helpers (create_sm_only_gc_streams,
create_wq_isolated_gc_streams, get_current_stream_gc_sm_count) that create
cuGreenCtxStreamCreate-bound streams directly via the Driver API. Unlike streams
created inside GreenContext.set_context(), these survive CUDA Graph capture/replay
with their SM partition intact.
Add DenseGEMMGCSMRunner (TunableRunner) in fused_moe_densegemm.py that sweeps
FC1 SM count candidates via the AutoTuner framework to find the optimal SM split for
FC1/Router overlap.
Extend DenseGEMMFusedMoE with a _gc_stream_pool pre-created at init time for all
candidate SM splits, enabling CUDA-graph-safe autotuning without re-creating
GreenContext streams at runtime.
Add sm_budget parameter to CuteDSLNVFP4DenseGemmSwigluRunner in
cute_dsl_custom_ops.py (excluded from unique_id so inner tuning is shared across
GC splits); register new custom ops cute_dsl_nvfp4_dynamic_dense_gemm_swiglu_blackwell,
cute_dsl_bf16_bmm_blackwell, and cute_dsl_bf16_gemm_blackwell.

Motivation

The DenseGEMM MoE path overlaps FC1 and Router GEMM to hide router latency. Previous
attempts used soft sm_budget hints (max_active_clusters) which don't prevent SM
contention at the hardware level. CUDA GreenContext provides true hardware SM isolation —
FC1 and Router CTAs are dispatched to disjoint SM partitions with no interference.

Add a CuTe DSL BF16 persistent GEMM kernel as an alternative BMM implementation for MLA (Multi-head Latent Attention) on Blackwell GPUs. Gated behind the `use_cute_dsl_bf16_bmm` flag and `is_sm_100f()` so it has zero impact on existing code paths when disabled. New files: - dense_gemm_persistent.py: Blackwell SM100 warp-specialized kernel with TMA loads, TMEM accumulators, and TMA store epilogue. Adapted from CUTLASS example with API compatibility fixes for the installed DSL. Integration: - CuteDSLBf16BlackwellBmmRunner + trtllm::cute_dsl_bf16_bmm_blackwell op in cute_dsl_custom_ops.py with AutoTuner tactic selection. - use_cute_dsl_bf16_bmm config plumbed through LlmArgs -> ModelConfig -> model_loader -> MLA attention (6 BMM call sites: k_b_proj and v_b_proj in generation, context, and sparse-MLA paths). - --use_cute_dsl_bf16_bmm CLI flag in quickstart_advanced.py. - Integration tests: single-GPU and 4-GPU (tp4/ep4) accuracy tests with GSM8K for DeepSeek-V3-Lite BF16 in test_llm_api_pytorch.py. Non-contiguous tensor handling: the runner makes inputs contiguous before extracting data pointers since the kernel layout assumes contiguous [B,M,K]. Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>

When mma_inst_tile_k > 1, cute.gemm() generates multiple sub-MMA instructions that all share the same ACCUMULATE flag. With ACCUMULATE=False on the first K tile, every sub-MMA cleared the accumulator so only the last sub-MMA's result survived, losing (mma_inst_tile_k - 1) * mma_inst_shape_k elements per output tile. This caused GSM8K accuracy to drop from 64.7% to 28.5%. Fix by adding an inner kblock loop that iterates sub-MMA instructions individually and sets ACCUMULATE=True after the first cute.gemm() call, matching the pattern used by blockscaled_contiguous_grouped_gemm.py. GSM8K accuracy restored to 64.86% (reference: 64.74%). Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>

…kwell Add use_cute_dsl_bf16_gemm flag to enable CuTe DSL BF16 persistent GEMM for unquantized Linear layers in MLA attention (kv_a_proj_with_mqa, q_b_proj, kv_b_proj). This complements the existing BF16 BMM support. Changes: - Add CuteDSLBf16BlackwellGemmRunner class and custom op in cute_dsl_custom_ops.py - Add use_cute_dsl_bf16_gemm parameter to Linear class and UnquantizedLinearMethod - Wire use_cute_dsl_bf16_gemm through ModelConfig, LlmArgs, and model_loader - Pass flag to MLA Linear layers in attention.py - Add --use_cute_dsl_bf16_gemm CLI argument to quickstart_advanced.py - Add integration tests for single GPU and 4 GPU configurations Signed-off-by: Pei He <peih@nvidia.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>

…GEMM (FP32 output) Enable CuTe DSL BF16 GEMM kernel for DeepseekV3Gate router GEMM on Blackwell. The router computes BF16 input @ BF16 weight -> FP32 logits, which our persistent GEMM kernel already supports via FP32 accumulator and FP32 output. Key changes: - Support FP32 output dtype in CuteDSLBf16BlackwellGemmRunner (detect from output tensor instead of hardcoding BF16, add c_dtype to kernel cache key) - Relax cute_dsl_bf16_gemm_blackwell custom op to accept BF16 or FP32 output - Add CuTe DSL dispatch in DeepseekV3Gate.forward() gated by use_cute_dsl_bf16_gemm flag, with fallback to dsv3_router_gemm_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>

… path Add wrapper_strided to PersistentDenseGemmKernel that accepts explicit A tensor strides, enabling non-contiguous views (e.g. from .transpose()) to be passed directly to TMA without .contiguous() copies. Update the BMM runner to compute and pass A strides instead of forcing contiguous tensors, removing the direct_copy_kernel_cuda overhead between attention and BMM. Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>

…ction_core Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>

…ed IDs The previous entries lacked pytest parameter brackets, which wouldn't match actual test node IDs. Expand to all 12 parametrized variants. Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>

…BMM code Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>

Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>

Apply ruff format/lint fixes: - Convert multi-line docstrings to single-line where appropriate (D200) - Remove f-string prefix on strings without placeholders (F541) - Remove unused import - Use consistent double-quote docstrings instead of single-quotes - Fix indentation in docstrings Signed-off-by: Peace He <103117813+peaceh-nv@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…M FC1+Router Introduce hardware-level SM isolation via CUDA GreenContext so that the FC1 GEMM and Router GEMM can execute truly in parallel without SM contention in the DenseGEMM MoE path. Key changes: - green_context.py: CUDA Driver API helpers (create_sm_only_gc_streams, create_wq_isolated_gc_streams, get_current_stream_gc_sm_count) that bypass PyTorch's GreenContext API to create cuGreenCtxStreamCreate- bound streams. These streams survive CUDA Graph capture/replay with their SM partition intact, unlike streams created inside GreenContext.set_context(). - fused_moe_densegemm.py: Add DenseGEMMGCSMRunner (TunableRunner) that sweeps FC1 SM count candidates via the AutoTuner framework to find the optimal SM split for FC1/Router overlap. Extend DenseGEMMFusedMoE with a _gc_stream_pool pre-created at init time for all candidate SM splits, enabling CUDA-graph-safe autotuning. - cute_dsl_custom_ops.py: Add sm_budget parameter to CuteDSLNVFP4DenseGemmSwigluRunner (excluded from unique_id so inner tuning is shared across GC splits); register new custom ops cute_dsl_nvfp4_dynamic_dense_gemm_swiglu_blackwell, cute_dsl_bf16_bmm_blackwell, and cute_dsl_bf16_gemm_blackwell. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>

…GEMM FC2 Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>

Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>

…P, and no-overlap strategies Refactor fused_moe_densegemm.py into three dedicated modules: - fused_moe_densegemm_gc.py: GreenContext-based overlap - fused_moe_densegemm_smp.py: SM-partitioned overlap - fused_moe_densegemm_no_overlap.py: no-overlap baseline Update configurable_moe.py, modeling_deepseekv3.py, cute_dsl_custom_ops.py, fc2.py, and utils.py accordingly. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>

peaceh-nv and others added 10 commits March 29, 2026 18:47

[None][qa] Add full paths for cute DSL bf16 BMM/GEMM tests in llm_fun…

b6f664d

…ction_core Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>

[TRTLLM-11289][fix] Fix pre-commit formatting for CuTe DSL BF16 GEMM/…

1110909

…BMM code Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>

update api stability

be2f1ea

Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>

github-actions bot assigned JacobHu-NV Apr 7, 2026

JacobHu-NV force-pushed the pr/densegemm-as-moe-overlap branch from ce51764 to e9f45af Compare April 7, 2026 08:41

JacobHu-NV added 4 commits April 8, 2026 02:36

[None][perf] Add split-K and extended tiling candidates for MoE Dense…

a729690

…GEMM FC2 Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>

remove unused dynamic fc1 ref

83f6324

Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>

merge pr

26c4344

Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][perf] Add GreenContext SM-partitioned overlap for MoE DenseGEMM FC1+Router#12802

[None][perf] Add GreenContext SM-partitioned overlap for MoE DenseGEMM FC1+Router#12802
JacobHu-NV wants to merge 15 commits intoNVIDIA:mainfrom
JacobHu-NV:pr/densegemm-as-moe-overlap

JacobHu-NV commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JacobHu-NV commented Apr 7, 2026

Summary

Motivation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants